Snapchat is one of the most popular social media apps in the world. It is no surprise, then, that many political ads are run on the service each year. Snap Inc.'s political ads library is part of an effort by the company to increase transparency in their advertising practices. The data analyzed in this project spans 2019-2020, and consists of information on every political ad that was run on the service in that timeframe, including who the ad buyer was, how much the ad cost, what areas it targeted, etc.
The main dataset used in this project was created by combining two datasets for each year. The start and end dates were converted to standard Pandas datetime format, and any spend amounts not in USD were converted to USD with forex_python to allow for more meaningful statistical analyses. To narrow focus on those advertisements which were targeted, the dataset was filtered to consist of ads that
The final dataset consisted of 1,175 rows, compared with the original 5,432. The rows with ads in the United States not targeted to a specific state are irrelevant to these analyses as they would simply have their spend amount distributed across all 50 states, making that data effectively worthless. To be able to work with data on a state-by-state level, the dataset had to be transformed as it had multiple states on one row if more than one state was targeted. To achieve this, rows for each state were created from the same ad if it targeted multiple states, and the spend amount was evenly distributed between the states.
Looking at raw amounts of ad money spent on each state is moot as there would be an obvious bias to states with higher population. Thus, we use a normalization equation from the U.S. government and data from the census bureau to normalize money spent in each state with its respective population.
To ascertain the nature of missingness in the original dataset, the Segments column was chosen as there is no explanation from Snapchat about why this particular column would be missing. It would not seem plausible at first glance that the data is This column is "advertiser-specific data used such as Snap Audience Match or Lookalike audiences." Permutation tests were run with the first 15 columns of the dataset to see if there was any relation between the missingness seen in Segments and the total variation distance of that particular column. It was determined with α = 0.05 that Segments is missing at random dependent on the CandidateBallotInformation column (p = 0.02). This makes sense as ads specifically supporting a candidate for political office would likely have data on individuals already, which would be in Segments. In contrast, the missingness of segments is likely not influenced by Impressions (p = 0.87) so the amount of impressions likely has no effect on if Segments is missing. Note: There is no missing data relevant to ad targeting, as if there is no data in a column related to targeting it simply means that the ad was not targeted at that level.
Finally, it was observed through geospatial plotting that Vermont seemed to have an abnormally high amount of ad dollars spent relative to its population. To further look into this, a question was posed of whether or not Vermont is disproportionately advertised to, relative to its population and other states. Specifically:
H0: The null hypothesis is that Vermont does not have an unusually high amount of money spent in ads targeted to it; any percieved abnormality is due to random chance.
Ha: The alternative hypothesis says that there is indeed a disproportionately high amount of money spent in advertising to Vermont.
The test statistic used was the normalized amount of money spent on Snapchat political ads in Vermont. For the hypothesis test, the column containing normalized amounts spent was repeatedly shuffled randomly, and the amount that was spent on Vermont in that particular random simulation was recorded. After all simulations were done, the likelihood of seeing the observed values in the set of simulated values was calculated. With α = 0.05 the null hypothesis was rejected (p = 0.02). While this does not mean the alternative hypothesis can be accepted, it does mean that the distribution of ad dollars to Vermont is not wholly random.
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import folium
import plotly.graph_objects as go
from forex_python.converter import CurrencyRates
c = CurrencyRates()
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # Higher resolution figures
df19 = pd.read_csv('2019.csv')
df20 = pd.read_csv('2020.csv')
#concatenate both years together
sc = pd.concat([df19,df20],ignore_index=True)
#convert dates to datetime
sc['StartDate'] = pd.to_datetime(sc['StartDate'])
sc['EndDate'] = pd.to_datetime(sc['EndDate'])
converted = sc[sc['Currency Code'] != 'USD'].apply(lambda row: c.convert(row['Currency Code'], 'USD', int(row['Spend'])), axis=1)
#convert non-USD currencies into USD
sc.loc[(sc['Currency Code'] != 'USD'),'Spend'] = converted
sc = sc.drop('Currency Code', axis=1)
sc.head()
What percentage of each column is missing? It turns out some fields appear to be mandatory with 0% missing, while others are missing quite frequently. This is inline with the readme file.
sc.isnull().mean() * 100
There appears to be a wide range of amounts spent in general.
sc['Spend'].plot(title='Spending on Snapchat Ads')
us = sc[sc['CountryCode'] == 'united states']
region = us[us['Regions (Included)'].isna() == False]
#Filter the dataset to only include ads targeted to states in the US
#This distribution appears identical to the global data
us['Spend'].plot(title='Spending on Snapchat Ads in the US')
State-targeted ads seem to have lower spending in general.
region['Spend'].plot(title='Spending on State-targeted Snapchat Ads in the US')
region['Spend'].describe()
We now look at ads with respect to the people posting them.
(sns.countplot(region['OrganizationName']).set_title('Number of Ads by Organization'))
region['OrganizationName'].describe()
region.groupby('OrganizationName')['Spend'].mean().sort_values(ascending=False).head()
#The highest average spenders
plt.figure(figsize=(10,10))
sns.boxplot(region['Spend'],region['OrganizationName']).set_title('Spending by Organization')
region.pivot_table(
index='Regions (Included)',
columns='OrganizationName',
values='Spend',
aggfunc='sum',
fill_value=0
).plot(figsize=(20,10),title='Company Ad Spending by Region')
plt.axis('off')
region.pivot_table(
index='StartDate',
columns='Regions (Included)',
values='Spend',
aggfunc='sum',
fill_value=0
).plot(title='Spending in regions over time',legend=False,figsize=(15,10))
#Get the spending by state by splitting each state into its own row
#Divide spend bt the number of states it's being spent on
region['SpendState'] = region['Spend'] /(region['Regions (Included)'].apply(lambda x: len(str(x).split(','))))
df2 = pd.DataFrame()
df2 = region['Regions (Included)'].str.split(',').apply(pd.Series)
df2.index = region.SpendState
df2 = df2.stack().reset_index('SpendState')
plt.figure(figsize=(40,10))
sns.countplot(df2[0]).set_title('Number of Ads by State')
df2[0].describe()
state_spending = df2.groupby(0)['SpendState'].sum()
state_spending.head()
#spending by state
pop = pd.read_csv('pop.csv') #census data from the Census Bureau
pop['State'] = pop['State'].str.strip('.')
pop['Population'] = pop['Population'].str.replace(',','').astype(float)
pop = pop.set_index('State')
df = pd.DataFrame(state_spending)
df = df.join(pop) #add population to the dataframe
df['SpendState_norm'] = (df['SpendState']/df['Population']) * 100000
#so we can normalize spending for each state
sns.scatterplot(df['Population'],df['SpendState']).set_title('Spending relative to population')
sns.scatterplot(df['Population'],df['SpendState_norm']).set_title('Normalized spending')
The raw spending has a predictable linear relationship while the normalized spending is relatively more constant.
df.sort_values(by='SpendState_norm',ascending=False).head()
It would appear Vermont is a hotspot.
fig = go.Figure(data=go.Choropleth(
locations=np.array(pd.Series(df.index).map(us_state_abbrev)),
z=df['SpendState_norm'],
locationmode='USA-states',
colorscale='mint',
autocolorscale=False,
marker_line_color='white',
colorbar_title="USD"
))
fig.update_layout(
title_text='Snapchat Political Ad Spending by State (Normalized)',
geo = dict(
scope='usa',
projection=go.layout.geo.Projection(type = 'albers usa'),
showlakes=True, # lakes
lakecolor='rgb(255, 255, 255)'),
)
fig.show()
Here we perform a permutation test to determine if the missingess of Segments is dependent on the first 15 columns of the data. α = 0.05
col = 'Segments'
for x in sc.columns[:15]:
distr = (
sc
.assign(is_null=sc[col].isnull())
.pivot_table(index='is_null', columns=x, aggfunc='size')
.apply(lambda x:x / x.sum(), axis=1)
)
n_repetitions = 100
tvds = []
for _ in range(n_repetitions):
# shuffle the current column
shuffled_col = (
sc[x]
.sample(replace=False, frac=1)
.reset_index(drop=True)
)
# put the shuffled column in a table
shuffled = (
sc
.assign(**{
x: shuffled_col,
'is_null': sc[col].isnull()
})
)
#total variation distance
shuffled = (
shuffled
.pivot_table(index='is_null', columns=x, aggfunc='size')
.apply(lambda x:x / x.sum(), axis=1)
)
tvd = shuffled.diff().iloc[-1].abs().sum() / 2
# append
tvds.append(tvd)
obs = distr.diff().iloc[-1].abs().sum() / 2
pval = np.mean(tvds > obs)
print (x,pval)
With α = 0.05, we can say that Segments is missing dependent on the CandidateBallotInformation column (p = 0.02). This makes sense as ads specifically supporting a candidate for political office would likely have data on individuals already, which would be in Segments. In contrast, the missingness of segments is likely not influenced by Impressions (p = 0.87) so the amount of impressions likely has no effect on if Segments is missing.
Is Vermont disproportionately targeted by Snapchat political ads, in terms of money spent advertising?
H0: The null hypothesis is that Vermont does not have an unusually high amount of money spent in ads targeted to it; any percieved abnormality is due to random chance.
Ha: The alternative hypothesis says that there is indeed a disproportionately high amount of money spent in advertising to Vermont.
Our test statistic will be the total amount of money (normalized) spent on Snapchat political ads targeted to Vermont from 2019-2020. α = 0.05
We will use the normalized data, because shuffling the raw amounts of money spent in each state is biased by the state's population. Using the normalized data eliminates this issue.
df.head()
df.loc['Vermont']
obs = df.loc['Vermont'][2]
stats = []
for x in range(5000):
shuffled_col = ( #shuffle the amounts
df['SpendState_norm']
.sample(replace=False, frac=1)
.reset_index(drop=True)
)
stats.append(shuffled_col[list(df.index).index('Vermont')])
np.count_nonzero(stats >= obs)/len(stats) #p-value is less than significance level
pd.Series(stats).hist(bins = 20)
plt.scatter(obs, 0, color='red', s=30);
We can see from the graph above that the observed result is seen very few times in the data we generated.
With α = 0.05 the null hypothesis is rejected (p = 0.02). While this does not mean the alternative hypothesis can be accepted, it does mean that the distribution of ad dollars to Vermont is not wholly random.